0.1 Report Set-up

Load all required packages for this report and define some global customization settings.

# for knitting the .rmd file to .html
if (! requireNamespace("knitr", quietly = TRUE)) {
  install.packages("knitr")
}
if (! requireNamespace("kableExtra", quietly = TRUE)) {
  install.packages("kableExtra")
}

# for creating models and analyzing expression data
if (! requireNamespace("edgeR", quietly = TRUE)) {
  BiocManager::install("edgeR")
}
if (! requireNamespace("limma", quietly = TRUE)) {
  BiocManager::install("limma")
}
if (! requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
if (! requireNamespace("gprofiler2", quietly = TRUE)) {
  install.packages("gprofiler2")
}

# for creating plots
if (! requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if (! requireNamespace("ggrepel", quietly = TRUE)) {
  install.packages("ggrepel")
}
if (! requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  BiocManager::install("ComplexHeatmap")
}
if (! requireNamespace("circlize", quietly = TRUE)) {
  BiocManager::install("circlize")
}

library(edgeR)
library(dplyr)
library(ggplot2)

# wraps lines that are too long
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=80), tidy=TRUE)
# set default behaviors for all chunks
knitr::opts_chunk$set(warning = FALSE, message = FALSE)

All table and figure outputs in this report are rendered using the knitr package [1], and kableExtra package [2] for styling. Some codes in this report are adapted from Lecture 6 - Differential Expression [3] and Lecture 7 - Annotation Resources and Prelim ORA [4].


1 Introduction

1.1 Background on the Data Set

Glucocorticoids (GC) is a class of steroid hormones that plays a role in regulating the immune system and certain aspects of the immune function, such as inflammation. The most common naturally-produced GC hormone in humans is cortisol, which are produced by the adrenal glands. Due to GC’s potent anti-inflammatory effects, several synthetic forms of GC are used as immunosuppressive drugs to treat various medical conditions such as asthma, allergies, autoimmune disorders, and cancer. [5]

The data set used in this report comes from Quatrini et al. (2022)’s paper: Glucocorticoids inhibit human hematopoietic stem cell differentiation toward a common ILC precursor [5], where they analyzed the the role of GC on regulating innate lymphoid cells (ILCs) differentiation, including cytotoxic natural killer cells and helper ILCs, from human hematopoietic stem cells (HSCs). The RNA-seq data set used in this report is a part of the overall study; it evaluates the effect of the presence of Dexamethasone (DEX), a glucocorticoid medication, on gene expressions of HSCs. The raw data set is downloaded from the NCBI Gene Expression Omnibus, with ID GSE186950. [6]

1.2 Data Processing and Normalization

In Assignment 1, I performed initial pre-processing and normalization of the data set. The raw data set contains 2 groups, control and conditioned (DEX), with 3 samples in each group: AF25, AF26, AF29. The data set initially contained gene expressions of 58683 genes, but 46243 genes were removed due to low counts with less than 1 count per million in at least 3 samples, leaving 12440 genes.

In the original data set, genes are labeled in a mix of HUGO gene symbols, GenBank accession IDs, EMBL identifiers, etc. Using the biomaRt package [7], we attempted to map non-HUGO identifiers to their corresponding HUGO gene symbols, but had to discard 847 (6.8%) genes with no matches, leaving a final set of 11593 unique genes.

knitr::include_graphics("figures/A2/raw_fltr_data_cpm.png")

Figure 1. log2 CPM Distributions of gene expressions of each sample in the data set, after filtering out genes with low-counts or without HUGO gene symbol match, and before TMM normalization. This figure is adapted from Assignment 1 with slight aesthetic modifications.

As shown in Fig.1, The filtered data set is approximately normally distributed with no outlier samples, and therefore we normalized the data using the Trimmed Mean of M-values (TMM) method, via the edgeR package. [8]

Since we’ll be using functions from the edgeR package to build the differential expression models in this report, the DGEList object is needed. Therefore, we’ll load the filtered data set (in counts), which excludes the removed genes as outlined above, via an .rds file, and then repeat the TMM-normalization process. The raw_fltr_matrix.rds file can be generated by running saveRDS(raw_fltr_matrix, 'raw_fltr_matrix.rds') after running all code chunks in a1.Rmd.

raw_fltr_matrix <- readRDS("raw_fltr_matrix.rds")

# group each sample into either the control (CTRL) or the conditioned (DEX)
# group.
groups <- data.frame(matrix(NA, nrow = 6, ncol = 2))
for (i in colnames(raw_fltr_matrix)) {
    idx <- match(i, colnames(raw_fltr_matrix))
    if (grepl("DEX", i)) {
        groups[idx, 1] <- "DEX"
        groups[idx, 2] <- substring(i, 1, 4)
    } else {
        groups[idx, 1] <- "CTRL"
        groups[idx, 2] <- substring(i, 1, 4)
    }
}
rownames(groups) <- colnames(raw_fltr_matrix)
colnames(groups) <- c("treatment", "sample")
kableExtra::kable_paper(knitr::kable(groups, format = "html"))
treatment sample
AF25 CTRL AF25
AF26 CTRL AF26
AF29 CTRL AF29
AF25DEX DEX AF25
AF26DEX DEX AF26
AF29DEX DEX AF29

Table 1. An Overview of the groups in the data set. The data set contains two factors: 1) Treatment, where each sample is catogorized into either the control (CTRL) or the conditioned (DEX); 2) Sample, where each sample belongs to one of the three biological entities: AF25, AF26, and AF29.

# Normalization via the TMM method
dge <- edgeR::DGEList(counts = raw_fltr_matrix, group = groups$treatment)

# calculate normalization factor
dge <- calcNormFactors(dge)
# get the normalized data
normalized_cpm <- cpm(dge)

# A preview of the normalized data.
kableExtra::kable_paper(knitr::kable(data.frame(head(normalized_cpm)), format = "html"))
AF25 AF26 AF29 AF25DEX AF26DEX AF29DEX
A1BG-AS1 8.844446 6.115007 7.757708 6.53587 6.690002 5.139955
A2M 82.753247 86.173319 117.025851 69.00011 61.464389 45.698876
AAAS 49.990344 35.644053 41.346933 43.04337 52.579231 44.203616
AACS 42.991696 31.138258 28.720025 47.99196 42.021572 35.699326
AAGAB 47.836914 42.965969 34.001869 55.74163 50.279542 46.166145
AAK1 51.220876 52.782164 50.920275 55.64826 44.739385 48.689395

Table 2. A preview of the normalized data set, in counts per million (CPM). Each row represents a unique gene as identified by their HUGO gene symbol in the row name, and each column represents a sample in the data set.


2 Differential Gene Expression

2.1 Global Overview of Differential Gene Expression

To start, we first create a heat map of gene expressions across all genes and all samples in the data set. This allows us to get an rough idea of the expression change patterns at-large. All heatmaps in this report are generated using the ComplexHeatmap package [9] and the circlize package [10] for generating interpolated colours.

# Create a row-normalized heatmap matrix
heatmap_matrix <- t(scale(t(normalized_cpm)))

The range of values in our heatmap matrix is [-2.0352289, 2.041092].
The two ends are approximately the same in magnitude, and so we use 0 as the middle point and set the color scheme to be “blue - white - red” where blue corresponds to lower gene expressions and red corresponds to higher gene expressions.

heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix), 0, max(heatmap_matrix)),
    c("blue", "white", "red"))

# Create the heatmap
gene_express_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix), show_row_dend = TRUE,
    show_column_dend = TRUE, col = heatmap_col, show_column_names = TRUE, show_row_names = FALSE,
    show_heatmap_legend = TRUE, name = "row-normalized gene \nexpression (CPM)",
    km = 3, column_title = "Row-Normalized Differential Gene Expressions (in CPM) In Each Sample")
gene_express_heatmap

Figure 2. Heatmap of row-normalized gene expressions, in counts-per-million (CPM), in each sample of the data set. Column-wise, each sample is hierarchically clustered based on the similarity of their gene expressions. Row-wise, each gene is hierarchically clustered based on their expressions across the samples. A k-mean clustering of size 3 is applied to the rows to cluster genes into 3 groups based on similarity in expressions.

In Fig.2, column-wise dendrogram shows that the six samples were cleanly clustered into two groups, control (CTRL) and conditioned (DEX). Row-wise, although the clustering pattern between individual genes are too difficult to identify, the three main groups as clustered by k-means do express some distinct differences between each other, andto an extent, between the control and conditioned group in each cluster.

An interesting note is that between the three biological entities (AF25, AF26, AF29), AF29 exhibits some unique expression patterns that differs from AF25 and AF26, in both the control and conditioned samples. The original paper where the data set came from did not have mentioning nor explanations to this difference, so it is likely that the difference is purely based on biological variability between the samples. Nevertheless, to verify, we built a multidimensional scaling (MDS) plot using the limma package [11] to illustrate the distances between samples.

limma::plotMDS(heatmap_matrix, col = c("blue", "red")[factor(groups$treatment)],
    main = "Distance Between Samples")

Figure 3. Multidimensional scaling (MDS) plot of samples in the data set. Distances on the plot are approximations of the typical log2 fold changes between the samples. Samples in the control (CTRL) group are colored blue and samples in the conditioned (DEX) group are colored red.

The MDS plot (Fig.3) supports the clustering patterns found in the heat map, both between the groups and between samples in each group. Namely, samples in the control group are clustered away from the conditioned group, and with in each of the two main clusters, AF29 are positioned farther away from AF25 and AF26. This aligns with the unqiue gene expression profiles of AF29 that observed in the heat map.

2.2 Model Design

2.2.1 Model #1: Treatment-Only Model Using Exact Test Model From edgeR

Without accounting for the biological variabilities between samples, our data set is quite simple with only two groups: control and conditioned. Therefore, we’ll first use the Exact Test model from the edgeR package [8] to test for the differential gene expressions between the two groups.

model_cond <- model.matrix(~groups$treatment + 0)
colnames(model_cond) <- c("CTRL", "DEX")
rownames(model_cond) <- colnames(normalized_cpm)
kableExtra::kable_paper(knitr::kable(model_cond, format = "html"))
CTRL DEX
AF25 1 0
AF26 1 0
AF29 1 0
AF25DEX 0 1
AF26DEX 0 1
AF29DEX 0 1

Table 3. Design of Model #1. In this model, only the treatment condition is considered and each sample belongs to either CTRL or DEX (conditioned).

dge_cond <- edgeR::estimateDisp(dge, model_cond)

et <- edgeR::exactTest(dge_cond, pair = c("CTRL", "DEX"))

top_et <- topTags(et, sort.by = "PValue", n = nrow(normalized_cpm))
kableExtra::kable_paper(knitr::kable(top_et$table[1:10, ], format = "html", digits = 150))
logFC logCPM PValue FDR
DAAM2 12.219507 7.260689 2.995010e-129 3.472115e-125
ADAMTS2 8.592964 6.707461 1.025497e-77 5.944291e-74
LPL 4.320328 6.105950 1.115675e-55 4.311341e-52
IL1R2 7.398576 7.941220 4.001718e-55 1.159798e-51
TIFAB -4.210233 5.021435 2.691740e-53 6.241069e-50
MRC2 -5.582371 5.082934 3.562996e-53 6.884302e-50
AIG1 3.172247 6.159028 2.335066e-50 3.859080e-47
FCER1A -4.956970 6.061634 2.663041e-50 3.859080e-47
MAOA 7.441233 5.763348 2.451648e-47 3.157995e-44
TPST1 4.763426 5.550884 2.007750e-46 2.327585e-43

Table 4. A preview of the top 10 differentially-expressed genes (in order of ascending p-value) under the treatment-only exact test model.

2.2.1.1 Model #1 Volcano Plot

To visualize the set of differentially expressed genes, we’ll use a Volcano plot. A portion of the following code is adapted from this tutorial [12] in terms of customization and gene label placements. The %>% shortcut from the dplyr package [13] was used for data manipulation, and the ggplot2 package [14] and the ggrepel package [15] were used to generate the Volcano plots.

top_et_label <- top_et$table %>%
    mutate(Expression = case_when(logFC > 0 & FDR < 0.05 ~ "Up-regulated", logFC <
        0 & FDR < 0.05 ~ "Down-regulated", TRUE ~ "Unchanged"))

# label the top five genes in either direction of change
top <- 5
top_genes_et <- bind_rows(top_et_label %>%
    filter(Expression == "Up-regulated") %>%
    arrange(FDR, desc(abs(logFC))) %>%
    head(top), top_et_label %>%
    filter(Expression == "Down-regulated") %>%
    arrange(FDR, desc(abs(logFC))) %>%
    head(top))

volplot_et <- ggplot2::ggplot(top_et_label, aes(logFC, -log(FDR, 10))) + geom_point(aes(color = Expression),
    size = 0.6) + xlab(expression("log"[2] * "FC")) + ylab(expression("-log"[10] *
    "FDR")) + scale_color_manual(values = c("dodgerblue3", "gray50", "firebrick3")) +
    guides(colour = guide_legend(override.aes = list(size = 4)))
volplot_et <- volplot_et + ggrepel::geom_label_repel(data = top_genes_et, mapping = aes(logFC,
    -log(FDR, 10), label = rownames(top_genes_et)), size = 3.5) + ggtitle("Volcano Plot of Differentially Expressed Genes Distribution in Model #1")
volplot_et

Figure 4. Volcano plot of differentially expressed genes in Model #1, accounting for only the treatment variability using the Exact Test Model from edgeR. Differentially expressed genes are coloured based on direction of change with blue indicating down-regulated genes and red indicating up-regulated genes (FDR-corrected p<0.05). Genes with the top five log fold change (in magnitude) in either directions are labelled.

2.2.1.2 Model #1 Heat Map

top_et_genes <- top_et[top_et$table$FDR < 0.05, ]
top_et_genes <- rownames(top_et_genes$table)

heatmap_matrix_tophits_et <- heatmap_matrix[which(rownames(heatmap_matrix) %in% top_et_genes),
    ]

heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_tophits_et), 0, max(heatmap_matrix_tophits_et)),
    c("blue", "white", "red"))

# Set colours
treat_colours <- c("orange", "darkblue")
names(treat_colours) <- c("DEX", "CTRL")

# Set annotations
treatment_col <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(treatment = rep(c("CTRL",
    "DEX"), c(3, 3))), col = list(treatment = treat_colours))

model_1_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits_et),
    cluster_rows = TRUE, cluster_columns = FALSE, show_row_dend = TRUE, show_column_dend = FALSE,
    col = heatmap_col, show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE,
    top_annotation = treatment_col, column_title = "Annotated Heatmap of Model #1",
    name = "row-normalized gene \nexpression (CPM)", )
model_1_heatmap

Figure 5. Annotated heatmap of differentially expressed genes in Model #1, comparing conditioned to control samples. Samples are annotated based on treatment conditions.

2.2.2 Model #2: Treatment + Sample Model Using Quasi-Liklihood Model From edgeR

Recall that in the global heatmap, we observed some differences in terms of gene expressions between the three biological entities. Although the paper did not mention this difference, we’ll build another model to see how sample variability affect out results. We use the Quasi-Likelihood Model from the edgeR package [8], and include both the treatment and sample conditions.

model_samp <- model.matrix(~groups$sample + groups$treatment)
rownames(model_samp) <- colnames(normalized_cpm)
kableExtra::kable_paper(knitr::kable(model_samp, format = "simple"))
(Intercept) groups$sampleAF26 groups$sampleAF29 groups$treatmentDEX
AF25 1 0 0 0
AF26 1 1 0 0
AF29 1 0 1 0
AF25DEX 1 0 0 1
AF26DEX 1 1 0 1
AF29DEX 1 0 1 1

Table 5. Design of Model #2. In this model, both the treatment and sample condition are considered and each sample belongs to one of (CTRL, DEX) as well as one of (AF25, AF26, AF29).

# Estimate dispersion
dge_samp <- edgeR::estimateDisp(dge, model_samp)

# Fit the Quasi Likelihood Model
fit <- edgeR::glmQLFit(dge_samp, model_samp)
qlf <- edgeR::glmQLFTest(fit, coef = "groups$treatmentDEX")
top_qlf <- topTags(qlf, sort.by = "PValue", n = nrow(normalized_cpm))

kableExtra::kable_paper(knitr::kable(top_qlf$table[1:10, ], format = "html", digits = 20))
logFC logCPM F PValue FDR
DAAM2 12.214339 7.258871 1107.3521 3.2000e-19 3.690360e-15
ADAMTS2 8.644263 6.704992 881.9582 3.0800e-18 1.788146e-14
IL1R2 7.360969 7.940311 841.6158 4.9200e-18 1.899649e-14
HLA-DQA1 -7.099430 6.611630 679.4327 4.1110e-17 1.191407e-13
MAOA 7.404167 5.759322 661.2241 5.3780e-17 1.246887e-13
MFGE8 4.399029 7.609859 619.0729 1.0307e-16 1.991496e-13
HLA-DPB1 -5.698476 7.211961 545.8348 3.5611e-16 5.897655e-13
TMIGD3 5.491745 5.200044 535.7945 4.2736e-16 6.078592e-13
CYP4F3 5.637621 5.289852 530.4112 4.7190e-16 6.078592e-13
CD74 -3.704543 11.448976 519.9534 5.7374e-16 6.651424e-13

Table 6. A preview of the top 10 differentially-expressed genes (in order of ascending p-value) under the treatment+sample QLM model.

2.2.2.1 Model #2 Volcano Plot

top_qlf_label <- top_qlf$table %>%
    mutate(Expression = case_when(logFC > 0 & FDR < 0.05 ~ "Up-regulated", logFC <
        0 & FDR < 0.05 ~ "Down-regulated", TRUE ~ "Unchanged"))

# label the top five genes in either direction of change
top <- 5
top_genes_qlf <- bind_rows(top_qlf_label %>%
    filter(Expression == "Up-regulated") %>%
    arrange(FDR, desc(abs(logFC))) %>%
    head(top), top_qlf_label %>%
    filter(Expression == "Down-regulated") %>%
    arrange(FDR, desc(abs(logFC))) %>%
    head(top))

volplot_qlf <- ggplot2::ggplot(top_qlf_label, aes(logFC, -log(FDR, 10))) + geom_point(aes(color = Expression),
    size = 0.6) + xlab(expression("log"[2] * "FC")) + ylab(expression("-log"[10] *
    "FDR")) + scale_color_manual(values = c("dodgerblue3", "gray50", "firebrick3")) +
    guides(colour = guide_legend(override.aes = list(size = 4)))
volplot_qlf <- volplot_qlf + ggrepel::geom_label_repel(data = top_genes_qlf, mapping = aes(logFC,
    -log(FDR, 10), label = rownames(top_genes_qlf)), size = 3.5, max.overlaps = Inf) +
    ggtitle("Volcano Plot of Differentially Expressed Genes Distribution in Model #2")
volplot_qlf

Figure 6. Volcano plot of differentially expressed genes in Model #2, accounting for both the treatment and sample variability using the Quasi-Liklihood Model from edgeR. Differentially expressed genes are coloured based on direction of change with blue indicating down-regulated genes and red indicating up-regulated genes (FDR-corrected p<0.05). Genes with the top five log fold change (in magnitude) in either directions are labelled.

2.2.2.2 Model #2 Heat Map

# Subset the threshold gene
top_qlf_genes <- top_qlf[top_qlf$table$FDR < 0.05, ]
top_qlf_genes <- rownames(top_qlf_genes$table)

heatmap_matrix_tophits_qlf <- heatmap_matrix[which(rownames(heatmap_matrix) %in%
    top_qlf_genes), ]

heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_tophits_qlf), 0, max(heatmap_matrix_tophits_qlf)),
    c("blue", "white", "red"))

# Set colours
treat_colours <- c("orange", "darkblue")
names(treat_colours) <- c("DEX", "CTRL")

samp_colors <- hcl.colors(3, palette = "set 2")
names(samp_colors) <- c("AF25", "AF26", "AF29")

# Set annotations
annot_col <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(treatment = rep(c("CTRL",
    "DEX"), c(3, 3)), sample = c("AF25", "AF26", "AF29", "AF25", "AF26", "AF29")),
    col = list(treatment = treat_colours, sample = samp_colors))

model_2_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits_qlf),
    cluster_rows = TRUE, cluster_columns = FALSE, show_row_dend = TRUE, show_column_dend = FALSE,
    col = heatmap_col, show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE,
    top_annotation = annot_col, column_title = "Annotated Heatmap of Model #2", name = "row-normalized gene \nexpression (CPM)",
    )
model_2_heatmap

Figure 7. Annotated heatmap of differentially expressed genes in Model #2, comparing conditioned to control samples as well as between the three biological entities. Samples are annotated based on treatment and sample conditions.

2.2.3 Comparisons Between the Two Models

model_1 <- data.frame(rownames(top_et$table), top_et$table$FDR)
model_2 <- data.frame(rownames(top_qlf$table), top_qlf$table$FDR)

# Merge the two model
merged_model <- merge(model_1, model_2, by.x = 1, by.y = 1)
colnames(merged_model) <- c("gene", "model_1_fdr", "model_2_fdr")


merged_model$colour <- "gray60"
merged_model$colour[merged_model$model_1_fdr < 0.05] <- "orange"
merged_model$colour[merged_model$model_2_fdr < 0.05] <- "dodgerblue3"
merged_model$colour[merged_model$model_1_fdr < 0.05 & merged_model$model_2_fdr <
    0.05] <- "firebrick3"

# Output comparison plot
plot(merged_model$model_1_fdr, merged_model$model_2_fdr, col = merged_model$colour,
    xlab = "Model #1 p-values (FDR-corrected)", ylab = "Model #2 p-values (FDR-corrected)",
    main = "Distribution of FDR-Corrected p-values for Genes in Model #1 and #2")
legend("topleft", legend = c("Model #1", "Model #2", "Both", "Not signif"), fill = c("orange",
    "dodgerblue3", "firebrick3", "gray60"))

Figure 8. Comparisons between Model #1 and #2 in terms of FDR-corrected p-value distribution.

From Fig.8, it is evident that Model #2 (treatment & sample variability) outputted more genes that are significantly differentially expressed. The final choice of model for over-representation analysis is explained in Model Choice.

2.3 Differential Expression Questions

1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?

I set the p-value threshold to be 0.05, where genes with p<0.05 are considered significantly differentially expressed. This threshold aligns with what the authors of this data set used in their paper [5], and it is also one of the mainstream choices in gene expression analyses that balances between the risk of false positives (picking up biologically-insignificant genes) and and false negatives (omitting biologically-significant genes).

  • In Model #1 (Treatment-Only Exact Test Model), 2808 genes pass the p-value threshold.

  • In Model #2 (Treatment + Sample Quasi-Liklihood Model), 4100 genes pass the p-value threshold.

Details of the p-value calculation for each model can be found in Model Design.

2. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?

I used Benjamini-Hochberg FDR method for multiple hypothesis correction, with p-value(FDR)<0.05 indicating a significant differential expression. This is automatically calculated as a part of the exactTest and glmQLFTest function [8]. The BH-FDR method is chosen because we wish to control false positive results that are more likely to occur the more testings that we do. This method assumes that the probability of detecting a significant discovery is equally likely among each of the tests, which is a reasonable assumption for each genes in our model, and is commonly used for other similar gene expression analyses. [16]

  • In Model #1 (Treatment-Only Exact Test Model), 1658 genes pass the p-value threshold after FDR correction.

  • In Model #2 (Treatment + Sample Quasi-Liklihood Model), 2945 genes pass the p-value threshold after FDR correction.

3. Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.

See Model #1 Volcano Plot and Model #2 Volcano Plot. The publication where my data set comes from did not specify any specific genes of interest, and therefore I highlighted the top 5 genes (with the greatest log fold change value) in both the up-regulated and down-regulated path.

4. Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.

See Model #1 Heat Map and Model #2 Heat Map. In both models, samples in each treatment conditions (CTRL vs. DEX) do cluster together among themselves, as shown in the clear separation between the two colors in the top annotation bar labeled ‘treatment’. Additionally, in Model #2’s heatmap, as shown in the column dendrogram and in the heatmap matrix, AF25 seems to cluster with AF26, while AF29 is more distinct compared to the rest in both the CTRL and DEX sample. These findings align with the pattern we saw in the global heatmap from Global Overview of Differential Gene Expression, and confirms that the presence of dexamethasone do have some impact on the expressions of certain genes.


3 Thresholded Over-Representation Analysis

3.1 Model Choice

In the last section, we identified the set of genes that are significantly up-regulated or down-regulated in each of our two models. Comparing Model #1 and #2, the top hits (with greatest logFC magnitudes) are mostly consisted of the same set of genes in slightly different orders. This is also evident in Fig.8 Comparisons Between the Two Models where the genes with p-value (FDR) < 0.05 in both models clustered in the bottom left corner in red. Although Model #2 returns more genes that are significantly differentially expressed (p-value (FDR) < 0.05, model #1: 1658, model #2: 2945), the p-values of its top hits are generally much larger than that of Model #1. In addition, comparing the heat maps of the two models, the significant set of genes in model #1 show a clearer distinction between the CTRL and DEX clusters than model #2. Lastly, Model #1 already contain a substantial set (1658) of significantly differentially-expressed genes, and adding more genes into our analyses may introduce more noise than biologically-relevant discoveries. Thus, we’ll do the subsequent analyses with Model #1.

In Fig.5 Model #1 Heat Map, it is indicated that the presence of dexamethasone do have some impact on gene expressions of certain genes. In this section, we’ll use g:Profiler [17] to run a thresholded gene set enrichment analysis on the significant genes to see which pathways are these genes involved in. We’ll use the gprofiler2 package [18] to directly access g:Profiler from R.

de_genes <- top_et$table

# extract genes that are up-regulated and down-regulated.
de_genes <- de_genes[which(de_genes$FDR < 0.05 & abs(de_genes$logFC) > log(2)), ]
upreg_genes <- de_genes[which(de_genes$logFC > log(2)), ]
downreg_genes <- de_genes[which(de_genes$logFC < -log(2)), ]
# Obtain data source version information
version_info <- gprofiler2::get_version_info(organism = "hsapiens")
version_info <- version_info$sources

data_sources <- c("GO:BP", "KEGG", "REAC", "WP")
version_info <- version_info[data_sources]
ver_info_df <- data.frame(sapply(version_info, `[[`, 2))
colnames(ver_info_df) <- c("Version")

# the output table of version information can be found in section 'Thresholded
# Over-Representation Analysis Questions'

3.2 Enrichment Analysis With All Differentially Expressed Genes

gp_de <- gprofiler2::gost(query = rownames(de_genes), organism = "hsapiens", sources = c("GO:BP",
    "KEGG", "REAC", "WP"), significant = TRUE, user_threshold = 0.05, correction_method = "fdr",
    exclude_iea = TRUE)

Interactive Manhattan Plot of g:Profiler Enrichment Analysis of All Significant Differentially Expressed Genes

Figure 9. Interactive Manhattan plot of g:Profiler output of all significant differentially expressed genes, including both up-regulated and down-regulated, thresholded based on p-value (FDR) < 0.05 and abs(logFC)>log(2). The enrichment analysis is done from 4 data sources: GO:BP, KEGG, REAC, WP, as labeled by color.

top_n <- 1:2


gp_de_result <- gp_de$result[order(gp_de$result$p_value, decreasing = FALSE), ]
gp_de_result <- subset(gp_de_result, select = c("term_id", "source", "term_name",
    "p_value"))
top_terms_de_1 <- gp_de_result[which(gp_de_result$source == "GO:BP"), ][top_n, ]
top_terms_de_2 <- gp_de_result[which(gp_de_result$source == "KEGG"), ][top_n, ]
top_terms_de_3 <- gp_de_result[which(gp_de_result$source == "REAC"), ][top_n, ]
top_terms_de_4 <- gp_de_result[which(gp_de_result$source == "WP"), ][top_n, ]

top_terms_de <- rbind(top_terms_de_1, top_terms_de_2, top_terms_de_3, top_terms_de_4)
row.names(top_terms_de) <- NULL
kableExtra::kable_paper(knitr::kable(top_terms_de, format = "html", digits = 50))
term_id source term_name p_value
GO:0002376 GO:BP immune system process 9.871125e-28
GO:0023052 GO:BP signaling 2.926590e-22
KEGG:04640 KEGG Hematopoietic cell lineage 1.652865e-05
KEGG:05310 KEGG Asthma 1.142178e-04
REAC:R-HSA-168256 REAC Immune System 7.353197e-15
REAC:R-HSA-6798695 REAC Neutrophil degranulation 6.701769e-09
WP:WP4884 WP Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex 4.206349e-03
WP:WP2882 WP Nuclear receptors meta-pathway 2.337374e-02

Table 7. Top two pathways (ordered by ascending p-values) returned by each of the four data sources in differentially expressed genes.

3.3 Enrichment Analysis With Up-Regulated Genes Only

gp_upreg <- gprofiler2::gost(query = rownames(upreg_genes), organism = "hsapiens",
    sources = c("GO:BP", "KEGG", "REAC", "WP"), significant = TRUE, user_threshold = 0.05,
    correction_method = "fdr", exclude_iea = TRUE)

Interactive Manhattan Plot of g:Profiler Enrichment Analysis of All Significant Up-Regulated Genes

Figure 10. Interactive Manhattan plot of g:Profiler output of all significant up-regulated genes, thresholded based on p-value (FDR) < 0.05 and logFC>log(2). The enrichment analysis is done from 4 data sources: GO:BP, KEGG, REAC, WP, as labeled by color.

gp_upreg_result <- gp_upreg$result[order(gp_upreg$result$p_value, decreasing = FALSE),
    ]
gp_upreg_result <- subset(gp_upreg_result, select = c("term_id", "source", "term_name",
    "p_value"))
top_terms_upreg_1 <- gp_upreg_result[which(gp_upreg_result$source == "GO:BP"), ][top_n,
    ]
top_terms_upreg_2 <- gp_upreg_result[which(gp_upreg_result$source == "KEGG"), ][top_n,
    ]
top_terms_upreg_3 <- gp_upreg_result[which(gp_upreg_result$source == "REAC"), ][top_n,
    ]
top_terms_upreg_4 <- gp_upreg_result[which(gp_upreg_result$source == "WP"), ][top_n,
    ]

top_terms_upreg <- rbind(top_terms_upreg_1, top_terms_upreg_2, top_terms_upreg_3,
    top_terms_upreg_4)

row.names(top_terms_upreg) <- NULL
kableExtra::kable_paper(knitr::kable(top_terms_upreg, format = "html", digits = 50))
term_id source term_name p_value
GO:0006954 GO:BP inflammatory response 1.290966e-08
GO:0016477 GO:BP cell migration 3.190944e-07
KEGG:01100 KEGG Metabolic pathways 1.325951e-05
KEGG:00520 KEGG Amino sugar and nucleotide sugar metabolism 2.387054e-02
REAC:R-HSA-6798695 REAC Neutrophil degranulation 7.400413e-25
REAC:R-HSA-168249 REAC Innate Immune System 1.353385e-15
WP:WP2882 WP Nuclear receptors meta-pathway 3.994737e-04
WP:WP2880 WP Glucocorticoid receptor pathway 2.913073e-03

Table 8. Top two pathways (ordered by ascending p-values) returned by each of the four data sources in up-regulated genes.

3.4 Enrichment Analysis With Down-Regulated Genes Only

gp_downreg <- gprofiler2::gost(query = rownames(downreg_genes), organism = "hsapiens",
    sources = c("GO:BP", "KEGG", "REAC", "WP"), significant = TRUE, user_threshold = 0.05,
    correction_method = "fdr", exclude_iea = TRUE)

Interactive Manhattan Plot of g:Profiler Enrichment Analysis of All Significant Down-Regulated Genes

Figure 11. Interactive Manhattan plot of g:Profiler output of all significant down-regulated genes, thresholded based on p-value (FDR) < 0.05 and logFC < -log(2). The enrichment analysis is done from 4 data sources: GO:BP, KEGG, REAC, WP, as labeled by color.

gp_downreg_result <- gp_downreg$result[order(gp_downreg$result$p_value, decreasing = FALSE),
    ]
gp_downreg_result <- subset(gp_downreg_result, select = c("term_id", "source", "term_name",
    "p_value"))
top_terms_downreg_1 <- gp_downreg_result[which(gp_downreg_result$source == "GO:BP"),
    ][top_n, ]
top_terms_downreg_2 <- gp_downreg_result[which(gp_downreg_result$source == "KEGG"),
    ][top_n, ]
top_terms_downreg_3 <- gp_downreg_result[which(gp_downreg_result$source == "REAC"),
    ][top_n, ]
top_terms_downreg_4 <- gp_downreg_result[which(gp_downreg_result$source == "WP"),
    ][top_n, ]

top_terms_downreg <- rbind(top_terms_downreg_1, top_terms_downreg_2, top_terms_downreg_3,
    top_terms_downreg_4)

row.names(top_terms_downreg) <- NULL
kableExtra::kable_paper(knitr::kable(top_terms_downreg, format = "html", digits = 30))
term_id source term_name p_value
GO:0023052 GO:BP signaling 1.203247e-22
GO:0007154 GO:BP cell communication 2.299139e-22
KEGG:04640 KEGG Hematopoietic cell lineage 2.063920e-07
KEGG:05310 KEGG Asthma 3.664767e-05
REAC:R-HSA-1280215 REAC Cytokine Signaling in Immune system 7.253035e-08
REAC:R-HSA-162582 REAC Signal Transduction 6.763319e-06
WP:WP4884 WP Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex 2.876630e-04
WP:WP69 WP T-cell receptor signaling pathway 4.961292e-03

Table 9. Top two pathways (ordered by ascending p-values) returned by each of the four data sources in down-regulated genes.

3.5 Thresholded Over-Representation Analysis Questions

1. Which method did you choose and why?

I chose to use g:Profiler via the gprofiler2 package. I chose g:Profiler because it is an up-to-date tool for functional enrichment analysis, with many of the mainstream data source readily available, allowing for comparisons between sources. I use the gprofiler2 package to submit queries directly from R, instead of manually submitting the gene list using the web tool and copy-pasting the results. This also allows for extensibility in my codes, so that the outputs are automatically updated whenever I make a change to the up-stream model. The thresholding is done via the fisher exact test and corrected using the Benjamini-Hochberg FDR procedure, which, as explained in the previous section (Differential Expression Questions), are mainstream choices for thresholding significant findings.

2. What annotation data did you use and why? What version of the annotation are you using?

I used GO Biological Pathway (GO:BP), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome (REAC), and Wikipathways (WK). All four of them are top-choice data sources that are being regularly updated. Out of the four, I have experience with GO:BP, REAC, and WK in the g:Profiler homework exercise, and I also added KEGG because it is another up-to-date source on biological pathways and it’s readily available in g:Profiler. I used 4 sources and extracted the top two terms for each source to avoid potential biases that may exist in the data sources.

All data sources are in the latest version that is available in the g:Profiler:

Version
GO:BP annotations: BioMart classes: releases/2022-12-04
KEGG KEGG FTP Release 2022-12-26
REAC annotations: BioMart classes: 2022-12-28
WP 20221210

Table 10. Version information of the 4 data sources used in g:Profiler enrichment analysis.

3. How many genesets were returned with what thresholds?

We set the threshold to be corrected p-value (FDR) < 0.05 and a fold change > 2 (abs(logFC) > log(2)). This threshold is higher than the example in class since I am working with a substantial set of genes and many of them have very high fold changes, as shown in the Model #1 Volcano Plot. I wish to extract the major biological pathways that were impacted by the differential expressions (as a result of the dexamethasone treatment), without the minor pathways convoluting the outputs.

The thresholded dataset contains 1443 genes (both up-regulated and down-regulated), and returned a total of 834 significant (p-value (FDR) < 0.05) genesets out of the 4 data sources.

4. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

summary_df <- data.frame(top_terms_de$source, top_terms_de$term_name, formatC(top_terms_de$p_value,
    format = "e", digits = 2), top_terms_upreg$term_name, formatC(top_terms_upreg$p_value,
    format = "e", digits = 2), top_terms_downreg$term_name, formatC(top_terms_downreg$p_value,
    format = "e", digits = 2))
colnames(summary_df) <- c("data_source", "top_terms_whole", "p_val_whole", "top_terms_up-reg",
    "p_val_up-reg", "top_terms_down-reg", "p_val_down-reg")

knitr::kable(summary_df, format = "html", digits = 30) %>%
    kableExtra::kable_paper()
data_source top_terms_whole p_val_whole top_terms_up-reg p_val_up-reg top_terms_down-reg p_val_down-reg
GO:BP immune system process 9.87e-28 inflammatory response 1.29e-08 signaling 1.20e-22
GO:BP signaling 2.93e-22 cell migration 3.19e-07 cell communication 2.30e-22
KEGG Hematopoietic cell lineage 1.65e-05 Metabolic pathways 1.33e-05 Hematopoietic cell lineage 2.06e-07
KEGG Asthma 1.14e-04 Amino sugar and nucleotide sugar metabolism 2.39e-02 Asthma 3.66e-05
REAC Immune System 7.35e-15 Neutrophil degranulation 7.40e-25 Cytokine Signaling in Immune system 7.25e-08
REAC Neutrophil degranulation 6.70e-09 Innate Immune System 1.35e-15 Signal Transduction 6.76e-06
WP Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex 4.21e-03 Nuclear receptors meta-pathway 3.99e-04 Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex 2.88e-04
WP Nuclear receptors meta-pathway 2.34e-02 Glucocorticoid receptor pathway 2.91e-03 T-cell receptor signaling pathway 4.96e-03

Table 11. Top two terms returned by each of the four data sources for each of the three sets of genes (all differentially expressed, up-regulated, down-regulated). This table is a merge of Table 7, 8, and 9.

  • Up-regulated: contains 568 genes and returned a total of 371 significant (p-value (FDR) < 0.05) genesets out of the 4 data sources.

  • Down-regulated: contains 875 genes and returned a total of 595 significant (p-value (FDR) < 0.05) genesets out of the 4 data sources.

From Table 11, we can see that there is a major theme among the top terms that are returned: immune system process, inflammatory response, Cytokine Signaling, and Neutrophil degranulation are all pathways related to the immune system. Other smaller categories include viral infections and hematopoietic cell lineage.

Comparing between results obtained from the whole list versus those obtained from the up-regulated or down-regulated set of genes only, we can see that most of the top terms are related (e.g. falls under the same category), if not the same.


4 Interpretation

Background and pre-processing of the data set can be found here: Background on the Data Set

Questions from previous sections can be found here:

1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

Yes, the over-representation results do support some the mechanisms discussed in the original paper. Aside from the obvious pathways like “Glucocorticoid receptor pathway” (WP:WP2880) and “Hematopoietic cell lineage” (KEGG:04640), which directly relates to the background of this data set (analyzing the effect of glucocorticoids (dexamethasone) in human hematopoietic stem cells), there are several additional results that are supported. In the original paper, the authors were investigating the effect of glucocorticoids (GC) on helper innate lymphoid cells (hILCs). hILCs are characterized by the high level of cytokines that they produce, and the authors concluded that GC suppresses hILC development and activity by observing a lower level of cytokines in the presence of GC. [5] In our enrichment analyses result, “Cytokine Signaling in Immune system” (REAC:R-HSA-1280215) is the top term returned by Reactome for the set of down-regulated genes, which matches with the author’s findings.

Other than cytokine signaling, the paper also mentioned that GC has anti-inflammatory properties, which is supported by “inflammatory response” (GO:0006954) in our result for the up-regulated genes, as well as a few other more general pathways related to immune system responses.

2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

I searched for more details on the mechanisms of GC on suppressing immune response to provide explanations for some of my results:


References

1. Xie Y. Knitr: A general-purpose package for dynamic report generation in r. 2023.
2. haozhu233. Haozhu233/kableextra: Construct complex table with knitr::kable() + pipe.
3. Ruth I. Lecture 6 - differential expression. 2022.
4. Ruth I. Lecture 7 - annotation resources and prelim ORA. 2022.
5. Quatrini L, Tumino N, Besi F, Ciancaglini C, Galaverna F, Grasso AG, et al. Glucocorticoids inhibit human hematopoietic stem cell differentiation toward a common ILC precursor. Journal of Allergy and Clinical Immunology. 2022;149:1772–85.
6. Quatrini L, Tumino N, Besi F, Ciancaglini C, Galaverna F, Grasso AG, et al. GSE186950: Glucocorticoids inhibit human hematopoietic stem cell differentiation toward a common ILC precursor. 2021.
7. Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the r/bioconductor package biomaRt. Nature Protocols. 2009;4:1184–91.
8. Robinson MD, McCarthy DJ, Smyth GK. edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;26:139–40.
9. Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32:2847–9.
10. Gu Z, Gu L, Eils R, Schlesner M, Brors B. Circlize implements and enhances circular visualization in r. Bioinformatics. 2014;30:2811–2.
11. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015;43:e47–7.
12. Gamboa-Tuz SD. Visualization of volcano plots in r.
13. Wickham H, François R, Henry L, Müller K. Dplyr: A grammar of data manipulation. 2018.
14. Wickham H. ggplot2: Elegant graphics for data analysis. Springer-Verlag New York; 2016.
15. Slowikowski K. Ggrepel: Automatically position non-overlapping text labels with ’ggplot2’. 2022.
16. Korthauer K, Kimes PK, Duvallet C, Reyes A, Subramanian A, Teng M, et al. A practical guide to methods controlling false discoveries in computational biology. Genome Biology. 2019;20.
17. Reimand J, Kull M, Peterson H, Hansen J, Vilo J. G:profilera web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Research. 2007;35:W193–200.
18. Kolberg L, Raudvere U, Kuzmin I, Vilo J, Peterson H. gprofiler2 – an r package for gene list functional enrichment analysis and namespace conversion toolset g:profiler. F1000Research. 2020;9:709.
19. Laethem FV, Baus E, Smyth LA, Andris F, Bex F, Urbain J, et al. Glucocorticoids attenuate t cell receptor signaling. Journal of Experimental Medicine. 2001;193:803–14.
20. He Y, Shi J, Yi W, Ren X, Gao X, Li J, et al. Discovery of a highly potent glucocorticoid for asthma treatment. Cell Discovery. 2015;1.
21. Shacham EC, Ishay A. New insights on effects of glucocorticoids in patients with SARS-CoV-2 infection. Endocrine Practice. 2022;28:1100–6.